Freedom exploration

  1. Data exploration

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [2]:
filename='TABLE_III._Deaths_in_122_U.S._cities.csv'
df = pd.read_csv(filename)
df = df[:1000]

Data exploration


In [3]:
df.describe()


Out[3]:
MMWR YEAR MMWR WEEK All causes, by age (years), All Ages** All causes, by age (years), >=65 All causes, by age (years), 45-64 All causes, by age (years), 25-44 All causes, by age (years), 1-24 All causes, by age (years), LT 1 P&I Total Location 2
count 1000.0 1000.000000 943.000000 941.000000 931.000000 843.000000 652.000000 606.000000 854.000000 0.0
mean 2016.0 4.309000 309.046660 210.873539 70.964554 19.971530 8.463190 7.374587 23.461358 NaN
std 0.0 2.210522 1182.356938 806.080518 269.965445 72.222615 26.813233 22.307541 85.209446 NaN
min 2016.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 NaN
25% 2016.0 2.000000 43.000000 30.000000 9.000000 2.000000 1.000000 1.000000 3.000000 NaN
50% 2016.0 4.000000 81.000000 57.000000 19.000000 6.000000 2.000000 2.000000 7.000000 NaN
75% 2016.0 6.000000 175.000000 118.000000 40.000000 12.000000 6.000000 5.000000 14.000000 NaN
max 2016.0 13.000000 12880.000000 8816.000000 2948.000000 797.000000 260.000000 221.000000 934.000000 NaN

We are dealing with many NaN values, but its not clear how to treat them all. I will take the NaNs into account afterwards.

Deaths per reporting area (all ages)


In [4]:
# Get a subsample of the dataset with deaths for all ages
# Do the report only for 2016 for simplicity
df_2016 = df[df['MMWR YEAR']==2016]
death_per_area = df_2016[['Reporting Area','All causes, by age (years), All Ages**']]
death_per_area.head()
death_per_area.columns=['Reporting Area','Deaths']

# 2. Drop NaNs:
print(len(death_per_area))
death_per_area = death_per_area.dropna()
print(len(death_per_area))

#sort them first in ascending order
death_per_area = death_per_area[:10]
death_per_area.head(20)


1000
943
Out[4]:
Reporting Area Deaths
0 New England 600.0
1 Mid. Atlantic 807.0
2 E.N. Central 2468.0
3 W.N. Central 634.0
4 S. Atlantic 1402.0
5 E.S. Central 1230.0
6 W.S. Central 2167.0
7 Mountain 1460.0
8 Pacific 2021.0
9 Total 12789.0

In [5]:
#This plot is too time consuming
# Initialize the matplotlib figure
#f, ax = plt.pyplot.subplots(figsize=(15, 6))


#Set context, increase font size
sns.set_context("poster", font_scale=1.5)
#Create a figure
plt.pyplot.figure(figsize=(15, 4))
#Define the axis object
ax = sns.barplot(x='Reporting Area', y='Deaths', data=death_per_area, palette="Blues_d")
#set parameters
ax.set(xlabel='Reporting Area', ylabel='Number of deaths', title= "Deaths per area")
plt.pyplot.xticks(rotation=45)

#show the plot
sns.plt.show()



In [6]:
df.mean()


Out[6]:
MMWR YEAR                                 2016.000000
MMWR WEEK                                    4.309000
All causes, by age (years), All Ages**     309.046660
All causes, by age (years), >=65           210.873539
All causes, by age (years), 45-64           70.964554
All causes, by age (years), 25-44           19.971530
All causes, by age (years), 1-24             8.463190
All causes, by age (years), LT 1             7.374587
P&I Total                                   23.461358
Location 2                                        NaN
dtype: float64

In [7]:
means=df.mean().values[3:8]
categories=['>=65','45-64','25-44','1-24','LT-1']
categories_ids=[1,2,3,4,5]
means


Out[7]:
array([ 210.87353879,   70.96455424,   19.97153025,    8.46319018,
          7.37458746])

In [8]:
# Initialize the matplotlib figure
#f, ax = plt.pyplot.subplots(figsize=(15, 6))

#Set context, increase font size
sns.set_context("poster", font_scale=1.5)
#Create a figure
plt.pyplot.figure(figsize=(15, 4))
#Define the axis object
ax = sns.barplot(x=categories, y=means,  palette="Blues_d")
#set parameters
ax.set(xlabel='Age category', ylabel='Deaths mean', title= "Deaths per age category")
#show the plot
sns.plt.show()


This plot shows the number of deaths per age category. As expected the number of deaths increases with age.


In [ ]: